Quiz: Check Your Understanding

In this lesson, you learned about many different algorithms for Temporal-Difference (TD) control. Later in this nanodegree, you'll learn more about how to adapt the Q-Learning algorithm to produce the Deep Q-Learning algorithm that demonstrated superhuman performance at Atari games.

Before moving on, you're encouraged to check your understanding by completing this brief quiz on Q-Learning.

## The Agent and Environment

Imagine an agent that moves along a line with only five discrete positions (0, 1, 2, 3, or 4). The agent can move left, right or stay put. (If the agent chooses to move left when at position 0 or right at position 4, the agent just remains in place.)

The Q-table has:

five rows, corresponding to the five possible states that may be observed, and
three columns, corresponding to three possible actions that the agent can take in response.

The goal state is position 3, but the agent doesn't know that and is going to learn the best policy for getting to the goal via the Q-Learning algorithm (with learning rate \alpha=0.2). The environment will provide a reward of -1 for all locations except the goal state. The episode ends when the goal is reached.

## Episode 0, Time 0

The Q-table is initialized.

Say the agent observes the initial state (position 1) and selects action stay.

As a result, it receives the next state (position 1) and a reward (-1.0) from the environment.

Let:

s_t denote the state at time step t,
a_t denote the action at time step t, and
r_t denote the reward at time step t.

Then, the agent now knows s_0, a_0,r_1 and s_1. Namely, s_0 = 1, a_0=\text{stay},r_1=-1.0, and s_1=1.

Using this information, it can update the Q-table value for Q(s_0, a_0).
Recall the equation for updating the Q-table:

{\displaystyle Q(s_{t},a_{t})\leftarrow (1-\alpha )\cdot \underbrace {Q(s_{t},a_{t})} _{\rm {old~value}}+\underbrace {\alpha } _{\rm {learning~rate}}\cdot \overbrace {{\bigg (}\underbrace {r_{t+1}} _{\rm {reward}}+\underbrace {\gamma } _{\rm {discount~factor}}\cdot \underbrace {\max _{a}Q(s_{t+1},a)} _{\rm {estimate~of~optimal~future~value}}{\bigg )}} ^{\rm {learned~value}}}

Note that this is equivalent to the equation below (from the video on Q-Learning):

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha(r_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t))

So the equation for updating Q(s_0, a_0) is:

Q(s_{0},a_{0})\leftarrow (1-\alpha )\cdot Q(s_{0},a_{0}) + \alpha \cdot (r_1 + \gamma \max_aQ(s_1,a))

Substituting our known values:

{\displaystyle Q(s_{0},a_{0})\leftarrow (1-0.2 )\cdot {Q(s_{0},a_{0})} +0.2\cdot {{\bigg (} {r_{1}} +{\max _{a}Q(s_{1},a)} {\bigg )}} }

We can find the old value for Q(s_{0},a_{0}) by looking it up in the table for state s_{0}=1 and action a_{0}=stay which is a value of 0. To find the estimate of the optimal future value \max {a}Q(s{1},a), we need to look at the entire row of actions for the next state, s_{1}=1 and choose the maximum value across all actions. They are all 0 right now, so the maximum is 0. Reducing the equation, we can now update Q(s_{0},a_{0}).

{\displaystyle Q(s_{0},a_{0})\leftarrow -0.2 }

## Episode 0, Time 1

At this step, an action must be chosen. The best action for position 1 could be either "left" or "right", since their values in the Q-table are equal.

Remember that in Q-Learning, the agent uses the epsilon-greedy policy to select an action. Say that in this case, the agent selects action right at random.

Then, the agent receives a new state (position 2) and reward (-1.0) from the environment.

The agent now knows s_1, a_1,r_2, and s_2.

QUESTION:

What is the updated value for Q(s_1, a_1)? (round to the nearest 10th)

ANSWER:

SOLUTION:

NOTE: The solutions are expressed in RegEx pattern. Udacity uses these patterns to check the given answer

## Episode n

Now assume that a number of episodes have been run, and the Q-table includes the values shown below.

A new episode begins, as before. The environment gives an initial state (position 1), and the agent selects action stay.

QUESTION:

What is the new value for Q(1,stay)? (round your answer to the nearest 10th)

ANSWER:

SOLUTION:

NOTE: The solutions are expressed in RegEx pattern. Udacity uses these patterns to check the given answer